Á¤º¸°úÇÐȸ ÄÄÇ»ÆÃÀÇ ½ÇÁ¦ ³í¹®Áö (KIISE Transactions on Computing Practices)
Current Result Document :
ÇѱÛÁ¦¸ñ(Korean Title) |
»ý¼ºÀû Àû´ë ½Å°æ¸Á°ú µ¥ÀÌÅÍ È®ÀåÀ» ÀÌ¿ëÇÑ µö·¯´× ±â¹Ý TTS À½Áú °³¼± |
¿µ¹®Á¦¸ñ(English Title) |
Fidelity Enhancement for Deep Learning-based TTS using a Generative Adversarial Network and Data Augmentation |
ÀúÀÚ(Author) |
ÃÖ Áø
¾çÁøÇõ
±èÀÎÁß
Jin Choi
Jinhyeok Yang
Injung Kim
|
¿ø¹®¼ö·Ïó(Citation) |
VOL 26 NO. 05 PP. 0256 ~ 0260 (2020. 05) |
Çѱ۳»¿ë (Korean Abstract) |
º» ³í¹®¿¡¼´Â »ý¼ºÀû Àû´ë ½Å°æ¸ÁÀ» ÀÌ¿ëÇØ µö·¯´× ±â¹Ý TTS ¸ðµ¨ÀÌ ÇÕ¼ºÇÑ ¸á ½ºÆåÆ®·Î±×·¥À» ½ÇÁ¦ À½¼ºÀÇ ¸á ½ºÆåÆ®·Î±×·¥°ú À¯»çÇØÁöµµ·Ï °³¼±ÇÏ´Â µö·¯´× ¸ðµ¨ TE-GAN(TTS Enhancement GAN)À» ¼Ò°³ÇÑ´Ù. TE-GANÀº À½¼º ½ÅÈ£ÀÇ Æ¯¼ºÀ» °í·ÁÇØ ¼³°èµÇ¾úÀ¸¸ç, ±×¸®ÇÉ-¸² ¾Ë°í¸®Áò°ú °°Àº °£´ÜÇÑ º¸ÄÚ´õ¿Í °áÇյǾ À½Áú °³¼± È¿°ú°¡ ¿ì¼öÇÏ´Ù. Ãß°¡ÀûÀ¸·Î TE-GANÀÇ È¿°úÀûÀÎ ÇнÀÀ» À§ÇØ ½Ã°£Àû ´ÙÁß ¿¡ÀÌÀüÆ®(temporal multi-agent, TMA)¿¡ ÀÇÇÑ µ¥ÀÌÅÍ È®Àå ¹æ¹ýÀ» Á¦¾ÈÇÑ´Ù. ½ÇÇèÀ» ÅëÇØ Á¦¾ÈÇÏ´Â ¹æ¹ýµéÀÌ TTS ½Ã½ºÅÛÀÌ ÇÕ¼ºÇÑ À½¼ºÀÇ À½ÁúÀ» Å©°Ô °³¼±ÇÒ ¼ö ÀÖÀ½À» º¸¿´´Ù. ½ÇÇè¿¡¼ TE-GANÀº Tacotron ÀÌ ÇÕ¼ºÇÑ ¸á ½ºÆåÆ®·³À» ½ÇÁ¦ À½¼ºÀÇ ¸á ½ºÆåÆ®·³°ú À¯»çÇϵµ·Ï °³¼±ÇÏ¿´À¸¸ç, ÇÕ¼ºµÈ À½¼ºÀÇ MOSµµ 2.07¿¡¼ MOS°¡ 3.24·Î Å©°Ô °³¼±µÇ¾ú´Ù.
|
¿µ¹®³»¿ë (English Abstract) |
In this paper, we introduce TE-GAN (TTS enhancement GAN) a deep learning model that enhances the Mel-spectrogram synthesized by a deep learning-based TTS model to be similar to that of human speech using a generative adversarial network. TE-GAN was designed by considering the characteristics of speech signals, and can significantly improve the fidelity of speech signals even when it is combined with a simple vocoder such as the Griffin-Lim algorithm. Additionally, we present a data augmentation technique using a Temporal Multi-Agent (TMA) approach for effective learning. Experimental results demonstrate that the proposed methods significantly improve the fidelity of the speech signals synthesized by the TTS system. In experiments, TE-GAN improved the Mel-spectrogram of Tacotron to make it more similar to the Mel-spectrogram of human speech, on top of this the MOS of synthesized speech was improved significantly from 2.07 to 3.24
|
Å°¿öµå(Keyword) |
µö·¯´×
À½¼ºÇÕ¼º
»ý¼ºÀû Àû´ë ½Å°æ¸Á
µ¥ÀÌÅÍ È®Àå
deep learning
speech synthesis
generative adversarial network
TTS À½Áú °³¼±
data augmentation
TTS fidelity enhancement
|
ÆÄÀÏ÷ºÎ |
PDF ´Ù¿î·Îµå
|